The relevance of house price and neighborhood characteristics has long been recognized. For example, public infrastructure investment, high qualifications of block residents, low crime rate benefit to adjacent properties. Interrelated urban housing submarkets might form a unitary urban housing market. There are four common factors in the submarkets: housing structural attributes, spatial attributes (the housing location), demander groups. Meanwhile, the joint influence of structural and spatial attributes could be the fourth factor of the submarkets.
However, predicting home prices is challenging. The factors affecting the price of the house are different due to the different areas of the house. In this project, we modeled the housing prices and related factors in the San Francisco area. The models we develop predict the price of homes in San Francisco based on local conditions and characteristics.
We collected the internal structure of the San Francisco house and the spatial structure of the street. And through the census data we obtained information on the local population situation and crime rate. Based on the above various data, we have initially completed our San Francisco house price forecasting model.
we collected our data from mainly two resources: DataSF and Social Explorer. From former website, we mostly collected spatial datas like distributions of crimes, trees, schools, neighborhoods’ boundaries, restaurants, parking lots etc. From latter one, we collected census data based on research boundaries we choose and spatial join them to the boudaries. We think characteristics of people live in neighborhoods have strong potential impact on house price. Last, we considered spatial lags and neighborhood fixed effect.However, these two variables do not brought great improvement to our model. ## Description of Predictors The table and plot below present summary statistics and distributions of values for our outcome variable and the 12 predictor variables in our model.
| Variable | Type | Category | Description | Min | Median | Max |
|---|---|---|---|---|---|---|
| Asian | Continuous | Demographic characteristic | percentage of Asians | 0 | 0.1 | 0.3 |
| Bachelor | Continuous | Demographic characteristic | percentage of Bachelors | 0 | 0.2 | 0.3 |
| Master | Continuous | Demographic characteristic | percentage of Masters | 0 | 0 | 0.1 |
| gas_used | Continuous | Demographic characteristic | percentage of gas-used households | 0.1 | 0.7 | 0.8 |
| electricit | Continuous | Demographic characteristic | percentage of electricity-used households | 0 | 0.1 | 0.2 |
| Median_GrI | Continuous | Demographic characteristic | Median gross income | 216 | 961 | 1097 |
| Average_Co | Continuous | Demographic characteristic | Average gross Income | 19 | 29 | 31 |
| pop | Continuous | Demographic characteristic | population of each neighborhoods | 231 | 3906 | 5140 |
| pop_den | Continuous | Demographic characteristic | population density of each neighborhoods | 318.8 | 15087.2 | 20130.8 |
| medinc | Continuous | Demographic characteristic | median household income | 22265 | 86601 | 95601 |
| avg_hsinc | Continuous | Demographic characteristic | average household income | 24134.7 | 69320.6 | 77834.8 |
| agg_hsinc | Continuous | Demographic characteristic | aggregate household income | 2921900 | 112877700 | 170305200 |
| inc_perca | Continuous | Demographic characteristic | income of per capita | 7428 | 21342 | 30367 |
| bycar | Continuous | Demographic characteristic | percentage of people work by car | 0.1 | 0.6 | 0.6 |
| byfoot | Continuous | Demographic characteristic | percentage of people work by foot | 0 | 0 | 0 |
| mhh_child | Continuous | Demographic characteristic | number of children | 0 | 44 | 62 |
| med_age | Continuous | Demographic characteristic | median age of neighborhood | 26 | 36 | 39 |
| LotArea | Continuous | Internal characteristic | Property area of Lot (sqft) | 0 | 187500 | 250000 |
| PropArea | Continuous | Internal characteristic | Property area of home (sqft) | 0 | 1150 | 1487 |
| Stories | Continuous | Internal characteristic | Number of stories | 0 | 1 | 1 |
| Rooms | Continuous | Internal characteristic | Number of rooms | 0 | 5 | 6 |
| Beds | Continuous | Internal characteristic | Number of beds | 0 | 0 | 2 |
| Baths | Continuous | Internal characteristic | Number of baths | 0 | 1 | 2 |
| SalePrice | Continuous | Outcome Variable | sale price | 100001 | 695001 | 930003 |
| lagPrice15 | Continuous | Spatial characteristic | Avg price of 5 nearest home sales | 236868.6 | 688622.2 | 969502.1 |
| crime.Buffer | Continuous | Spatial characteristic | Number of crimes within 1/8 mile | 3 | 82 | 109 |
| crime_nn5 | Continuous | Spatial characteristic | Avg distance of 5 nearest crimes | 30.4 | 76.1 | 93.3 |
| rest.Buffer | Continuous | Spatial characteristic | Number of restaurants within 1/8 mile | 4 | 38 | 62 |
| schl_nn5 | Continuous | Spatial characteristic | Avg distance of 5 nearest crimes | 87.4 | 374.6 | 472 |
| tree_nn5 | Continuous | Spatial characteristic | Avg distance of 5 nearest trees | 3.2 | 21.3 | 27 |
| bus_nn5 | Continuous | Spatial characteristic | Avg distance of 5 nearest buses | 39.4 | 124.7 | 163.2 |
| parking_nn5 | Continuous | Spatial characteristic | Avg distance of 5 nearest parkings | 14.1 | 224.6 | 415.9 |
| nbor | Categorical | Spatial characteristic | neighborhood name |
|
|
|
| SaleYr | Categorical | Temporal | Year the house was sold |
|
|
|
| BuiltYear | Categorical | Temporal | Year the house was Built |
|
|
|
| BuiltYear | Continuous | NA | NA | 0 | 1913 | 1929 |
| SaleYr | Continuous | NA | NA | 12 | 12 | 13 |
We use correlation plot to choose variables with high coefficients ( >.9) and select them out.Remove some of them based on their p-value from summary of our first model to minimize collinearity among the variables in our model, as collinear variables can rob each other of predictive power.
Average price of 15 nearest homes (lagPrice15), Avg distance of 5 nearest trees (tree_nn5), median gross income (Median_GrI), and income per capita (inc_perca) were four of our most important variables. By plotting them against sale price, we see positive correlations besides Avg distance of 5 nearest trees (tree_nn5).
## Map of Dependent variables (plots)
We created a multiple linear regression model to predict house price.After select predictors we think that might have potential impact on house prices. We use correlation coefficients to select some predctors out to prevent colinearity. Then we use simple feature engineered some variables like number of bathrooms, number of bedrooms etc. We categorized them into several types because treat these variables might not be the best way to improve our model’s accuracy.Next, we are going to test if there’s spatial lag exist here,namely,Do model errors exhibit spatial autocorrelation? Thus we calculate the average prices of nearest 15 houses of each house point.And the result tells us that spatial autocorrelation does exists.The observed Moran’s I of 0.1342575 seems marginal but the p-value of 0.001 suggests that model errors exhibit greater spatial autocorrelation than what would otherwise be expected due to random chance alone.Additionally, prices and errors seem to vary across neighborhoods. We considered that there is a ‘neighborhood effect??? that can help predict variation in price. So we considered neighborhoods of San Francisco in our model but turned out that it did not have much help to the accuracy of our model.
The table below shows the MAE, MAPE and R-square value of our model.
On average, our predictions were off by about $250,000. And our predictions were off by about 25.4%. Our model accounted for 71.5% of the variation in the sale price in the test set.
| MAE | MAPE | R-Squared |
|---|---|---|
| 251753 | 25.41 % | 0.7148072 |
The cross-validation results show evaluation statistics about our predicted house price values and regular sale prices. The multiple R-squared is 71.48%, and the adjusted R-squared could be 71.22 %.
We start with a histogram showing the error values of the distribution. According to the results of the model, we can intuitively see that the error distribution is not uniform. For some houses, our error value will be very large. Despite this, most of the error values of our model are relatively small.
We also created a density histogram, which has the advantage that we can better see the distribution of the data. According to the figure, we can see that our model and the real house price are still purely error, especially when it is close to the average. But the distribution pattern of our model is close to the normal distribution. The prediction has a amonunt of deviations. The orange line indicates the perfect fit line of aligned points to the model, and the green line presents the average predicted fit of the model.
There is a spatial correlation between housing price data.
As the sales price error increases, the nearby price error increases.
As home sales prices rise, the price of nearby homes will also rise.
We calculated a Moran I of 0.12634, which indicates a slight accumulation of residuals in our test set. The p value is 0.001, indicating that a small number of clusters is larger than the cluster expected only by random chances.
| nbor | meanPrice | meanPrediction |
|---|---|---|
| Bayview Hunters Point | 524043.7 | 524043.7 |
| Bernal Heights | 1092916.9 | 1092916.9 |
| Castro/Upper Market | 1709029.8 | 1709029.8 |
| Chinatown | 1240001.9 | 1240001.9 |
| Excelsior | 656803.3 | 656803.3 |
| Financial District/South Beach | 1295820.0 | 1295820.0 |
| Glen Park | 1380719.7 | 1380719.7 |
| Haight Ashbury | 1666201.1 | 1666201.1 |
| Hayes Valley | 1269453.6 | 1269453.6 |
| Inner Richmond | 1576612.6 | 1576612.6 |
| Inner Sunset | 1265486.6 | 1265486.6 |
| Japantown | 584501.0 | 584501.0 |
| Lakeshore | 898960.8 | 898960.8 |
| Lincoln Park | 1000002.0 | 1000002.0 |
| Lone Mountain/USF | 1316757.0 | 1316757.0 |
| Marina | 2291989.6 | 2291989.6 |
| McLaren Park | 472552.2 | 472552.3 |
| Mission | 1174369.8 | 1174369.8 |
| Mission Bay | 932948.0 | 932948.0 |
| Nob Hill | 1522764.9 | 1522764.9 |
| Noe Valley | 1757886.1 | 1757886.1 |
| North Beach | 1417865.4 | 1417865.4 |
| Oceanview/Merced/Ingleside | 688476.3 | 688476.3 |
| Outer Mission | 732060.5 | 732060.5 |
| Outer Richmond | 1158357.1 | 1158357.1 |
| Pacific Heights | 2279064.8 | 2279064.8 |
| Portola | 673652.9 | 673652.9 |
| Potrero Hill | 1252597.4 | 1252597.4 |
| Presidio Heights | 2209032.7 | 2209032.7 |
| Russian Hill | 1906431.7 | 1906431.7 |
| Seacliff | 2557626.0 | 2557626.0 |
| South of Market | 885257.9 | 885257.9 |
| Sunset/Parkside | 888651.5 | 888651.5 |
| Tenderloin | 3400002.5 | 3400002.5 |
| Twin Peaks | 1300414.3 | 1300414.3 |
| Visitacion Valley | 565451.3 | 565451.3 |
| West of Twin Peaks | 1261246.5 | 1261246.5 |
| Western Addition | 1129470.7 | 1129470.7 |
| Context | MAE | MAPE |
|---|---|---|
| High Income | 311225.1 | 0.2504901 |
| Low Income | 190969.2 | 0.2613050 |
| high poverty rate | 229464.3 | 0.2754896 |
| low poverty rate | 271319.8 | 0.2374056 |
Although our model is more accurate in some cases, the model is not suitable for use. The accuracy of our model is not sufficient to predict all real rates in the San Francisco area. Therefore, our model cannot be considered as a valid model. In the process of creating the model, we found that as the data increased, our model became more accurate. So in the future we hope to collect more data. For example, we want to collect data related to high road and housing prices. Most houses with higher house prices are clustered together, but this is not obvious in our model. Probably because our Moran value is 0.14, this value is closer to 0, which indicates that our model accounts for most of the price change in the price.
I think our model is not suitable for predicting housing prices in San Francisco. We would not recommend our model to zillow. In the alien model, we should use spatial lag to quantify spatial autocorrelation instead of making its residual. Using logarithm to transform data in the OLS model is more effective when modeling. In the future model we need to add more excellent related variables, this approach can make our predictions closer to the true value.
One thing that our model lacks is reliable information about the characteristics of the house being analyzed. For example, information about the seller, information about the buyer. This information can help us create models better. Therefore, more information about housing is needed to better inform our models.